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Abstract 

Hierarchical feature extractors such as Convolutional 
Networks (ConvNets) have achieved impressive perfor¬ 
mance on a variety of classification tasks using purely feed¬ 
forward processing. Feedforward architectures can learn 
rich representations of the input space but do not explic¬ 
itly model dependencies in the output spaces, that are quite 
structured for tasks such as articulated human pose estima¬ 
tion or object segmentation. Here we propose a framework 
that expands the expressive power of hierarchical feature 
extractors to encompass both input and output spaces, by 
introducing top-down feedback. Instead of directly predict¬ 
ing the outputs in one go, we use a self-correcting model 
that progressively changes an initial solution by feeding 
back error predictions, in a process we call Iterative Error 
Feedback (IFF). IFF shows excellent performance on the 
task of articulated pose estimation in the challenging MPII 
and LSP benchmarks, matching the state-of-the-art without 
requiring ground truth scale annotation. 

1. Introduction 

Feature extractors such as Convolutional Networks 
(ConvNets) 12^ represent images using a multi-layered hi¬ 
erarchy of features and are inspired by the structure and 
functionality of the visual pathway of the human brain 
ifTSim. Feature computation in these models is purely feed¬ 
forward, however, unlike in the human visual system where 
feedback connections abound |[TT][2Tl|22l. Feedback can be 
used to modulate and specialize feature extraction in early 
layers in order to model temporal and spatial context (e.g. 
priming 1421), to leverage prior knowledge about shape for 
segmentation and 3D perception, or simply for guiding vi¬ 
sual attention to image regions relevant for the task under 
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consideration. 

Here we are interested in using feedback to build pre¬ 
dictors that can naturally handle complex, structured output 
spaces. We will use as running example the task of 2D hu¬ 
man pose estimation ESlISlEilEll, where the goal is to 
infer the 2D locations of a set of keypoints such as wrists, 
ankles, etc, from a single RGB image. The space of 2D 
human poses is highly structured because of body part pro¬ 
portions, left-right symmetries, interpenetration constraints, 
joint limits (e.g. elbows do not bend back) and physical con¬ 
nectivity (e.g. wrists are rigidly related to elbows), among 
others. Modeling this structure should make it easier to pin¬ 
point the visible keypoints and make it possible to estimate 
the occluded ones. 

Our main contribution is in providing a generic frame¬ 
work for modeling rich structure in both input and output 
spaces by learning hierarchical feature extractors over their 
joint space. We achieve this by incorporating top-down 
feedback - instead of trying to directly predict the target 
outputs, as in feedforward processing, we predict what is 
wrong with their current estimate and correct it iteratively. 
We call our framework Iterative Error Feedback, or IFF. 

In IFF, a feedforward model / operates on the aug¬ 
mented input space created by concatenating (denoted by 
0) the RGB image / with a visual representation g of the 
estimated output yt to predict a “correction” (e^) that brings 
yt closer to the ground truth output y. The correction sig¬ 
nal et is applied to the current output yt to generate ^t+i 
and this is converted into a visual representation by g, that 
is stacked with the image to produce new inputs = I 
0 g{yt) for /, and so on iteratively. This procedure is ini¬ 
tialized with a guess of the output (^o) and is repeated until 
a predetermined termination criterion is met. The model 
is trained to produce bounded corrections at each iteration, 
e.g. ||et ||2 < L. The motivation for modifying yt by a 
bounded amount is that the space of Xt is typically highly 
non-linear and hence local corrections should be easier to 
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Figure 1: An implementation of Iterative Error Feedback (IFF) for 2D human pose estimation. The left panel shows the input 
image / and the initial guess of keypoints represented as a set of 2D points. For the sake of illustration we show only 3 out 
of 17 keypoints, corresponding to the right wrist (green), left wrist (blue) and top of head (red). Consider iteration t\ predictor 
/ receives the input Xf - image / stacked with a “rendering” of current keypoint positions yt - and outputs a correction e^. 
This correction is added to yt, resulting in new keypoint position estimates The new keypoints are rendered by function 
g and stacked with image /, resulting in Xt+i, and so on iteratively. Function / was modeled here as a ConvNet. Function 
g converts each 2D keypoint position into one Gaussian heatmap channel. For 3 keypoints there are 3 stacked heatmaps 
which are visualized as channels of a color image. In contrast to previous works, in our framework multi-layered hierarchical 
models such as ConvNets can learn rich models over the joint space of body configurations and images. 


learn. The working of our model can 
described by the following equations: 

be mathematically 

II 

(1) 

Vt+l — Vt-ir et 

(2) 

xt +1 = I® g{yt+i), 

(3) 


channels of the image. We model / with a ConvNet with 
parameters 0/ (i.e. ConvNet weights). As the ConvNet 
takes / 0 ^(^t) as inputs, it has the ability to learn features 
over the joint input-output space. 

2. Learning 


where functions / and g have additional learned param¬ 
eters 0/ and Qg, respectively. Although we have used the 
predicted error to additively modify yt in equationj^ in gen¬ 
eral can be a result of an arbitrary non-linear function 
that operates on e^. 

In the running example of human pose estimation, yt is 
vector of retinotopic positions of all keypoints that are indi¬ 
vidually mapped by g into heatmaps (i.e. K heatmaps for K 
keypoints). The heatmaps are stacked together with the im¬ 
age and passed as input to / (see figurefor an overview). 
The “rendering” function g in this particular case is not 
learnt - it is instead modelled as a 2D Gaussian having a 
fixed standard deviation and centered on the keypoint loca¬ 
tion. Intuitively, these heatmaps encode the current belief in 
keypoint locations in the image plane and thus form a natu¬ 
ral representation for learning features over the joint space 
of body configurations and the RGB image. 

The dimensionality of inputs to f is H x W x {K + 3), 
where 77, W represent the height and width of the image 
and (7f 0 3) correspond to K keypoints and the 3 color 


In order to infer the ground truth output (y), our method 
iteratively refines the current output (yt). At each iteration, 
/ predicts a correction (et) that locally improves the current 
output. Note that we train the model to predict bounded 
corrections, but we do not enforce any such constraints at 
test time. The parameters (0/, 0^) of functions / and g in 
our model, are learnt by optimizing equation]^ 
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min 
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'^h{et,e{y,yt)) 

t=l 


(4) 


where, et and e(y^yt) are predicted and target bounded 
corrections, respectively. The function is a measure of 
distance, such as a quadratic loss. T is the number of cor¬ 
rection steps taken by the model. T can either be chosen to 
be a constant or, more generally, be a function of et (i.e. a 
termination condition). 

We optimize this cost function using stochastic gradient 
descent (SGD) with every correction step being an indepen¬ 
dent training example. We grow the training set progres- 















Algorithm 1 Learning Iterative Error Feedback with Fixed 
Path Consolidation_ 

1: procedure FPC -Learn 
2: Initialize 

3: E ^ 

4: for t ^ 1 to {Tsteps) do 

5: for all training examples (/, ^) do 

6 : et^e{y,yt) 

7: end for 

8: E i — E U €t 

9: for j ^ 1 to do 

10: update 0/ and 0^ with SGD, using loss h 

and target corrections E 

11: end for 

12: end for 

13: end procedure 


sively: we start by learning with the samples corresponding 
to the first step for N epochs, then add the samples corre¬ 
sponding to the second step and train another N epochs, and 
so on, such that early steps get optimized longer - they get 
consolidated. 

As we only assume that the ground truth output (y) is 
provided at training time, it is unclear what the intermediate 
targets (yt) should be. The simplest strategy, which we em¬ 
ploy, is to predefine yt for every iteration using a set of fixed 
corrections e(^, yt) starting from yo, obtaining 
We call our overall learning procedure Fixed Path Consoli¬ 
dation (FPC) which is formally described by algorithm 
The target bounded corrections for every iteration are 
computed using a function e{y,yt), which can take differ¬ 
ent forms for different problems. If for instance the output 
is ID, then e(y, yt) = max{sign{y -yt)-a,y- yt) would 
imply that the target “bounded” error will correct yt by a 
maximum amount of a in the direction of y. 

2.1. Learning Human Pose Estimation 

Human pose was represented by a set of 2D keypoint lo¬ 
cations y : {y^ G G where K is the number of 

keypoints and y^ denotes the keypoint. The predicted 
location of keypoints at the iteration has been denoted by 
yt : {yt^k G [1, AT]}. The rendering of yt as heatmaps con¬ 
catenated with the image was provided as inputs to a Con- 
vNet (see section[T]for details). The ConvNet was trained to 
predict a sequence of “bounded” corrections for each key- 
point (e^) . The corrections were used to iteratively refine 
the keypoint locations. 

Let u = y^ — y^ and the corresponding unit vector be 
u = . Then, the target “bounded” correction for the 

iteration and keypoint was calculated as: 

= min(L, ||m||) ■ M (5) 


where L denotes the maximum displacement for each key- 
point location. An interesting property of this function is 
that it is constant while a keypoint is far from the ground 
truth and varies only in scale when it is closer than L to the 
ground truth. This simplifies the learning problem: given an 
image and a fixed initial pose, the model just needs to pre¬ 
dict a constant direction in which to move keypoints, and 
to ’’slow down” motion in this direction when the keypoint 
becomes close to the ground truth. See fig. [^for an illustra¬ 
tion. 

The target corrections were calculated independently for 
each keypoint in each example and we used an I /2 regres¬ 
sion loss to model h in eq. We set L to 20 pixels in 
our experiments. We initialized yo as the median of ground 
truth 2D keypoint locations on training images and trained 
a model for T = 4 steps, using N = 3 epochs for each 
new step. We found the fourth step to have little effect on 
accuracy and used 3 steps in practice at test time. 

ConvNet architecture. We employed a standard Con¬ 
vNet architecture pre-trained on Imagenet: the very deep 
googlenet Q We modified the filters in the first convo¬ 
lution layer (conv-1) to account for 17 additional channels 
due to 17 keypoints. In our model, the conv-1 filters op¬ 
erated on 20 channel inputs. The weights of the first three 
conv-1 channels (i.e. the ones corresponding to the image) 
were initialized using the weights learnt by pre-training on 
Imagenet. The weights corresponding to the remaining 17 
channels were randomly initialized with Gaussian noise of 
variance 0.1. We discarded the last layer of 1000 units that 
predicted the Imagenet classes and replaced it with a layer 
containing 32 units, encoding the continuous 2D correction 
[^expressed in Cartesian coordinates (the 17th ’’keypoint” is 
the location of one point anywhere inside a person, marking 
her, and which is provided as input both during training and 
testing, see section |^. We used a fixed ConvNet input size 
of 224 X 224. 

3. Results 

We tested our method on the two most challenging 
benchmarks for 2D human pose estimation: the MPII Hu¬ 
man Pose dataset which features significant scale varia¬ 
tion, occlusion, and multiple people interacting, and Leeds 
Sports Pose dataset (LSP) (Til which features complex 
poses of people in sports. For each person in every image, 
the goal is to predict the 2D locations of all its annotated 
keypoints. 

MPII - Experimental Details. Human pose is represented 

^The VGG-16 network (21 produced similar results, but required sig¬ 
nificantly more memory. 

^ Again, we do not bound explicitly the correction at test time, instead 
the network is taught to predict bounded corrections. 











Figure 2: In our human pose estimation running example, the sequence of corrections moves keypoints along lines in the 
image, starting from an initial mean pose (left), all the way to the ground truth pose y (right), here shown for two different 
images. This simplifies prediction at test time, because the desired corrections to each keypoint are constant for each image, 
up to the last one which is a scaled version. Feedback allows the model to detect when the solution is close and to reduce 
’’keypoint motion”, as in a control system. Linear trajectories are shown for only a subset of the keypoints, to limit clutter. 


as a set of 16 keypoints. An additional marking-point in 
each person is available both for training and testing, lo¬ 
cated somewhere inside each person’s boundary. We rep¬ 
resent this point as an additional channel and stack it with 
the other 16 keypoint channels and the 3 RGB channels that 
we feed as input to a ConvNet. We used the same publicly 
available train/validation splits of 1371 . We evaluated the 
accuracy of our algorithm on the validation set using the 
standard PCKh metric Il2), and also submitted results for 
evaluation on the test set once, to obtain the final score. 

We cropped 9 square boxes centered on the marking- 
point of each person, sampled uniformly over scale, from 
1.4x to 0.3 X of the smallest side of the image and resized 
them to 256 x 256 pixels. Padding was added as necessary 
for obtaining these dimensions and the amount of training 
data was further doubled by also mirroring the images. We 
used the ground truth height of each person at training time, 
which is provided on MPII, and select as training examples 
the 3 boxes for each person having a side closest to 1.2 x the 
person height in pixels. We then trained googlenet models 
on random crops of 224 x 224 patches, using 6 epochs of 
consolidation for each of 4 steps. At test time, we predict 
which one of the 9 boxes is closest to 1.2 x the height of 
the person in pixels, using a shallower model, the VGG-S 
ConvNet 0, trained for that task using an L 2 regression 
loss. We then align our model to the center 224 x 224 patch 
of the selected window. The MatConvnet library ll43]| was 


employed for these experiments. 

We train our models using keypoint positions for both 
visible and occluded keypoints, which MPII provides in 
many cases whenever they project on to the image (the ex¬ 
ception are people truncated by the image border). We zero 
out the backpropagated gradients for missing keypoint an¬ 
notations. Note that often keypoints lie outside the cropped 
image passed to the ConvNet, but this poses no issues to our 
formulation - keypoints outside the image can be predicted 
and are still visible to the ConvNet as tails of rendered Gaus- 
sians. 

Comparison with State-of-the-Art. The standard evalua¬ 
tion procedure in the MPII benchmark assumes ground truth 
scale information is known and images are normalized us¬ 
ing this scale information. The current state-of-the-art is 
the sliding-window approach of Tompson et al llT7]| and IFF 
roughly matches this performance, as shown in table In 
the more realistic setting of unknown scale information, the 
best previous result so far is from Tompson et al. E3 which 
was the first work to experiment with this setting and ob¬ 
tained 66.0 PCKh. IFF significantly improves upon this 
number to 81.3. Note however that the emphasis in Tomp¬ 
son et al’s system was efficiency and they trained and tested 
their model using original image scales - searching over a 
multiscale image pyramid or using our automatic rescaling 
procedure should presumably improve their performance. 
See the MPII website for more detailed results. 







Head 

Shoulder 

Elbow 

Wrist 

Hip 

Knee 

Ankle 

UBody 

FBody 

Yang & Ramanan 1481 

73.2 

56.2 

41.3 

32.1 

36.2 

33.2 

34.5 

43.2 

44.5 

Pischulin et al l2^ 

74.2 

49.0 

40.8 

34.1 

36.5 

34.4 

35.1 

41.3 

44.0 

Tompson et al. 

96.1 

91.9 

83.9 

77.8 

80.9 

72.3 

64.8 

84.5 

82.0 

lEF 

95.7 

91.6 

81.5 

72.4 

82.7 

73.1 

66.4 

82.0 

81.3 

Tompson et al. 

83.4 

77.5 

67.5 

59.8 

64.6 

55.6 

46.1 

68.3 

66.0 

lEF 

95.5 

91.6 

81.5 

72.4 

82.7 

73.1 

66.9 

81.9 

81.3 


Table 1: MPII test set PCKh-0.5 results for Iterative Error Feedback (lEF) and previous approaches, when ground truth scale 
information at test time is provided (top) and in the more automatic setting when it is not available (bottom). UBody and 
FBody stand for upper body and full body, respectively. 



Figure 3: Evolution of PCKh at 0.5 overlap as function 
of correction step number on the MPII-human-pose valida¬ 
tion set, using the finetuned googlenet network. The model 
aligns more accurately to parts like the head and shoulders, 
which is natural, because these parts are easier to discrimi¬ 
nate from the background and have more consistent appear¬ 
ance than limbs. 

LSP - Experimental Details. In ESP, differently from 
MPII, images are usually tight around the person whose 
pose is being estimated, are resized so people have a fixed 
size, and have lower resolution. There is also no marking 
point on the torsos so we initialized the 17th keypoints used 
in MPII to the center of the image. The same set of key- 
points is evaluated as in MPII and we trained a model us¬ 
ing the same hyper-parameters on the extended LSP train¬ 
ing set. We use the standard LSP evaluation code supplied 
with the MPII dataset and report person-centric PCP scores 
in table[^ Our results are competitive with the current state- 
of-the-art of Chen and Yuille O . 

4. Analyzing lEF 

In this section, we perform extensive ablation studies to 
validate four choices of the lEF model: 1) proceeding it¬ 
eratively instead of in a single shot, 2) predicting bounded 
corrections instead of directly predicting the target outputs, 
3) curriculum learning of our bounded corrections, and 4) 
modeling the structure in the full output space (all body 


joints in this case) over carrying out independent predic¬ 
tions for each label. 

Iterative v/s Direct Prediction. For evaluating the impor¬ 
tance of progressing towards solutions iteratively we trained 
models to directly predict corrections to the keypoint lo¬ 
cations in a single shot (i.e. direct prediction). Table 
shows that lEF that additively regresses to keypoint loca¬ 
tions achieves PCKh-0.5 of 81.0 as compared to PCKh of 
74.8 achieved by directly regressing to the keypoints. 

Iterative Error Feedback v/s Iterative Direct Prediction. 

Is iterative prediction of the error important or iterative pre¬ 
diction of the target label directly (as in e.g., (45] (411) per¬ 
forms comparably? In order to answer this question we 
trained a model from the pretrained googlenet to iteratively 
predict the ground truth keypoint locations (as opposed to 
predicting bounded corrections). For comparing perfor¬ 
mance, we used the same number of iterations for this base¬ 
line model and lEF. Table [3] shows that lEF achieves PCKh- 
0.5 of 81.0 as compared to PCKh of 73.4 by iterative di¬ 
rect prediction. This can be understood by the fact that the 
learning problem in IFF is much easier. In IFF, for a given 
image, the model is trained to predict constant corrections 
except for the last one which is a scaled version. In iterative 
direct prediction, because each new pose estimate ends up 
somewhere around the ground truth, the model must learn 
to adjust directions and magnitudes in all correction steps. 

Importance of Fixed Path Consolidation (FPC). The FPC 

method (see algorithm[^ for training a lEF model makes N 
corrections is a curriculum learning strategy where in the 
< N) training stage the model is optimized for per¬ 
forming only the first i corrections. Is this curriculum learn¬ 
ing strategy necessary or can all the corrections be simulta¬ 
neously trained? For addressing this question we trained an 
alternative model that trains for all corrections in all epochs. 
We trained lEF with and without FPC for the same number 
of SGD iterations and the performance of both these mod¬ 
els is illustrated in figure The figure shows that without 
FPC, the performance drops by almost 10 PCKh points on 
the validation set and that there is significant drift when per¬ 
forming several correction steps. 


















Torso 

Upper Leg 

Lower Leg 

Upper Arm 

Forearm 

Head 

Total 

Pishchulin et al. 

88.9 

64.0 

58.1 

45.5 

35.1 

85.1 

58.0 

Tompson et al. 

90.3 

70.4 

61.1 

63.0 

51.2 

83.7 

66.6 

Fan et al. (9) 

95.4 

77.7 

69.8 

62.8 

49.1 

86.6 

70.1 

Chen and Yuille 0 

96.0 

77.2 

72.2 

69.7 

58.1 

85.6 

73.6 

IFF 

95.3 

81.8 

73.3 

66.7 

51.0 

84.4 

72.5 


Table 2: Person-centric PCP scores on the LSP dataset test set for lEF and previous approaches. 



Head 

Shoulder 

Llbow 

Wrist 

Hip 

Knee 

Ankle 

UBody 

FBody 

Iterative Lrror Feedback (ILF) 

95.2 

91.8 

80.8 

71.5 

82.3 

73.7 

66.4 

81.4 

81.0 

Direct Prediction 

92.9 

89.4 

74.1 

61.7 

79.3 

64.0 

53.3 

75.1 

74.8 

Iterative Direct Prediction 

91.9 

88.5 

73.3 

59.9 

77.5 

61.2 

51.8 

74.0 

73.4 


Table 3: PCKh-0.5 results on the MPII validation set for models finetuned from googlenet using Iterative Error Feedback 
(IFF), direct regression to the keypoint locations (direct prediction), and a model that was trained to iteratively predict human 
pose by regressing to the ground truth keypoint locations (instead of bounded corrections) in each iteration, starting from the 
pose in the previous iteration. The results show that our proposed approach results in significantly better performance. 



Step Number 

Figure 4: Validation PCKh-0.5 scores for different num¬ 
ber of correction steps taken, when finetuning a lEF model 
from a googlenet base model using stochastic gradient de¬ 
scent with either Fixed Path Consolidation {With FPC), or 
directly over all training examples {Without FPC), for the 
same amount of time. FPC leads to significantly more accu¬ 
rate results, leading to models that can perform more correc¬ 
tion steps without drifting. It achieves this by consolidating 
the learning of earlier steps and progressively increasing the 
difficulty of the training set by adding additional correction 
steps. 


Learning Structured Outputs. One of the major merits of 
lEF is supposedly that it can jointly learn the structure in in¬ 
put images and target outputs. For human pose estimation, 
lEF models the space of outputs by augmenting the image 
with additional input channels having gaussian renderings 
centered around estimated keypoint locations . If it is the 
case that lEF learns priors over the appropriate relative lo¬ 


cations of the various keypoints, then depriving the model 
of keypoints other than the one being predicted should de¬ 
crease performance. 

In order to evaluate this hypothesis we trained three dif¬ 
ferent lEF models and tested how well each predicted the 
location of the “Left Knee” keypoint. The first model had 
only one input channel corresponding to the left knee, the 
second model had two channels corresponding to left knee 
and the left hip. The third model was trained using all key- 
points in the standard lEF way. The performance of these 
three models is reported in table As a baseline, regres¬ 
sion gets 64.6, whereas the lEF model with a single ad¬ 
ditional input channel for the left knee gets PCKh of 69.2 
This shows that feeding back the current estimate of the left 
knee keypoint allows for more accurate localization by it¬ 
self. Furthermore, the lEF model over both left knee and 
left hip gets PCKh of 72.8. This suggests that the relation¬ 
ship between neighboring outputs has much of the informa¬ 
tion, but modeling all joints together with the image still 
wins, obtaining a PCKh of 73.8. 

5. Related Work 

There is a rich literature on structured output learning 
llQllTl (e.g. see references in 1^ ) but it is a relatively mod¬ 
ern topic in conjunction with feature learning, for computer 
vision El da El 121. 

Here we proposed a feedback-based framework for 
structured-output learning. Neuroscience models of the hu¬ 
man brain suggest that feedforward connections act as infor¬ 
mation carriers while numerous feedback connections act as 
modulators or competitive inhibitors to aid feature grouping 
C3, figure-ground segregation ca and object recognition 
1461 . In computer vision, feedback has been primarily used 
so far for learning selective attention 1^ ; in (25\ attention 
is implemented by estimating a bounding box in an image 















Direct Prediction of All Joints 

lEF Left Knee 

lEF Left Knee + Left Hip 

IFF All Joints 

Left Knee PCKh-0.5 

64.6 

69.2 

72.8 

73.8 


Table 4: MPII validation PCKh-0.5 results for left knee localization when using lEF and both training and predicting 
different subsets of joints. We also show the result obtained using a direct prediction variant similar to plain regression on 
all joints (having the mean pose Gaussian maps in the input). Modeling global body structure jointly with the image leads to 
best results by ’TEF All Joints”. Interestingly, feedback seems to add value by itself and IFF on the left knee, in isolation, 
significantly outperforms the direct prediction baseline. 


for the algorithm to process next, while in |[35l attention is 
formed by selecting some convolutional features over others 
(it does not have a spatial dimension). 

Stacked inference methods ED El mi ED are another 
related family of methods. Differently, some of these meth¬ 
ods consider each output in isolation 1^ , all use differ¬ 
ent weights or learning models in each stage of inference 
l37l or they do not optimize for correcting their current esti¬ 
mates but rather attempt to predict the answer from scratch 
at each stage EllIlD- In concurrent work, Oberweger et 
al E 2 I proposed a feedback loop for hand pose estimation 
from kinect data that is closely related to our approach. The 
autocontext work of ED is also related and iteratively com¬ 
putes label heatmaps by concatenating the image with the 
heatmaps previously predicted. lEF is inspired by this work 
and we show how this iterative computation can be carried 
out effectively with deep Convnet architectures, and with 
bounded error corrections, rather than aiming for the an¬ 
swer from scratch at each iteration. 

Another line of work aims to inject class-specific spatial 
priors using coarse-to-fine processing, e.g. features arising 
from different layers of ConvNets were recently used for in¬ 
stance segmentation and keypoint prediction da. For pose 
inference, combining multiple scales ITOlETl aids in captur¬ 
ing subtle long-range dependencies (e.g. distinguishing the 
left and right sides of the body which depend on whether a 
person is facing the camera). The system in our human pose 
estimation example can be seen as closest to approaches 
employing “pose-indexed features” lElElEl, but leverag¬ 
ing hierarchical feature learning. Graphical models can also 
encode dependencies between outputs and are still popular 
in many applications, including human pose estimation 0. 

Classic spatial alignment and warping computer vision 
models, such as snakes, |[20l and Active Appearance Mod- 
els (AAMs) |6l have similar goals as the proposed lEF, but 
are not learned end-to-end - or learned at all - employ lin¬ 
ear shape models and hand designed features and require 
slower gradient computation which often takes many itera¬ 
tions before convergence. They can get stuck in poor local 
minimas even for constrained variation (AAMs and small 
out-of-plane face rotations). lEF, on the other hand, is able 
to minimize over rich articulated human 3D pose variation, 
starting from a mean shape. Although extensions that use 
learning to drive the optimization have been proposed ll47]| . 


typically these methods still require manually defined en¬ 
ergy functions to measure goodness of fit. 

6. Conclusions 

While standard ConvNets offer hierarchical representa¬ 
tions that can capture the patterns of images at multiple lev¬ 
els of abstraction, the outputs are typically modeled as fiat 
image or pixel-level 1-of-K labels, or slightly more com¬ 
plicated hand-designed representations. We aimed in this 
paper to mitigate this asymmetry by introducing Iterative 
Error Feedback (lEF), which extends hierarchical represen¬ 
tation learning to output spaces, while leveraging at heart 
the same machinery. IFF works by, in broad terms, moving 
the emphasis from the problem of predicting the state of the 
external world to one of correcting the expectations about 
it, which is achieved by introducing a simple feedback con¬ 
nection in standard models. 

In our pose estimation working example we opted for 
feeding pose information only into the first layer of the Con- 
vNet for the sake of simplicity. This information may also 
be helpful for mid-level layers, so as to modulate not only 
edge detection, but also processes such as junction detec¬ 
tion or contour completion which advanced feature extrac¬ 
tors may need to compute. We also have only experimented 
so far feeding back ’’images” made up of Gaussian distri¬ 
butions. There may be more powerful ways to render top- 
down pose information using parametrized computational 
blocks (e.g. deconvolution) that can then be learned jointly 
with the rest of the model parameters using standard back- 
propagation. This is desirable in order to attack problems 
with higher-dimensional output spaces such as 3D human 
pose estimation |[32l|33l or segmentation. 
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Figure 5: Example poses obtained using the proposed method IFF on the MPII validation set. From left to right we show the 
sequence of corrections the method makes - on the right is the ground truth pose, including annotated occluded keypoints, 
which are not evaluated. Note that IFF is robust to left-right ambiguities and is able to rotate the initial pose by up to 180 (first 
and fifth row), can align across occlusions (second and third rows) and can handle scale variation (second, fourth and fifth 
rows) and truncation (fifth row). The bottom two rows show failure cases. In the first one, the predicted configuration captures 
the gist of the pose but is misaligned and not scaled properly. The second case shows several people closely interacting and 
the model aligns to the wrong person. The black borders show padding. Best seen in color and with zoom. 
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